read data
df <- read.csv("~/Desktop/BU-Master Courses/2nd Sem Classes/AD 699 A1 Data Mining /AD 699 Group Project/copenhagen_listings.csv", header=TRUE)
id
df <- df %>%
mutate(
id = as.character(id),
scrape_id = as.character(scrape_id),
host_id = as.character(host_id)
)
str(df[, c("id", "scrape_id", "host_id")])
## 'data.frame': 21707 obs. of 3 variables:
## $ id : chr "31094" "32379" "32841" "38499" ...
## $ scrape_id: chr "20241230011627" "20241230011627" "20241230011627" "20241230011627" ...
## $ host_id : chr "129976" "140105" "142143" "122489" ...
We convert the id, scrape_id, and host_id columns in our dataset to character type because these variables represent identifiers rather than numeric values with mathematical meaning. After the conversion, we check the structure of these columns to confirm that the data types have been correctly updated.
date
df <- df %>%
mutate(
last_scraped = ymd(last_scraped),
host_since = ymd(host_since),
calendar_last_scraped = ymd(calendar_last_scraped),
first_review = ymd(first_review),
last_review = ymd(last_review)
)
str(df[, c("last_scraped", "host_since", "calendar_last_scraped", "first_review", "last_review")])
## 'data.frame': 21707 obs. of 5 variables:
## $ last_scraped : Date, format: "2024-12-30" "2024-12-30" ...
## $ host_since : Date, format: "2010-05-22" "2010-06-07" ...
## $ calendar_last_scraped: Date, format: "2024-12-30" "2024-12-30" ...
## $ first_review : Date, format: "2010-08-16" "2010-08-23" ...
## $ last_review : Date, format: "2022-08-22" "2024-10-28" ...
We convert the last_scraped, host_since, calendar_last_scraped, first_review, and last_review columns to date format using the ymd() function because these variables represent specific points in time rather than plain text or numbers. We do this to ensure that any future operations involving dates, such as calculating durations or filtering listings by time periods, are handled correctly and efficiently.
num
df <- df %>%
mutate(
host_response_rate = as.numeric(gsub("%", "", host_response_rate)) / 100,
host_acceptance_rate = as.numeric(gsub("%", "", host_acceptance_rate)) / 100,
price = as.numeric(gsub("[$,]", "", price))
)
## Warning: There were 2 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `host_response_rate = as.numeric(gsub("%", "",
## host_response_rate))/100`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
str(df[, c("host_response_rate", "host_acceptance_rate", "price")])
## 'data.frame': 21707 obs. of 3 variables:
## $ host_response_rate : num NA 1 NA 1 1 NA NA NA 1 NA ...
## $ host_acceptance_rate: num NA 1 NA 1 0.65 0.89 0.92 1 0.57 0.5 ...
## $ price : num NA NA NA 3000 2200 ...
We clean and convert the host_response_rate, host_acceptance_rate, and price columns to numeric values to prepare them for quantitative analysis. We do this by removing any non-numeric characters, such as percentage signs or currency symbols, and then converting the cleaned strings into numbers. For the response and acceptance rates, we further divide by 100 to express them as decimal form instead of percentages, making them easier to use in calculations or modeling.
bool
df <- df %>%
mutate(
host_is_superhost = ifelse(host_is_superhost == "t", TRUE,
ifelse(host_is_superhost == "f", FALSE, NA)),
host_has_profile_pic = ifelse(host_has_profile_pic == "t", TRUE,
ifelse(host_has_profile_pic == "f", FALSE, NA)),
host_identity_verified = ifelse(host_identity_verified == "t", TRUE,
ifelse(host_identity_verified == "f", FALSE, NA)),
instant_bookable = ifelse(instant_bookable == "t", TRUE,
ifelse(instant_bookable == "f", FALSE, NA)),
has_availability = ifelse(has_availability == "t", TRUE,
ifelse(has_availability == "f", FALSE, NA))
)
str(df[, c("host_is_superhost", "host_has_profile_pic", "host_identity_verified", "instant_bookable", "has_availability")])
## 'data.frame': 21707 obs. of 5 variables:
## $ host_is_superhost : logi FALSE TRUE FALSE TRUE TRUE TRUE ...
## $ host_has_profile_pic : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ host_identity_verified: logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ instant_bookable : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ has_availability : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
We transform the host_is_superhost, host_has_profile_pic, host_identity_verified, instant_bookable, and has_availability columns into logical values by recoding entries labeled “t” as TRUE and “f” as FALSE, while leaving any other unexpected or missing entries as NA to preserve data integrity.
NA过多删除
df <- df %>%
dplyr::select(-neighbourhood_group_cleansed, -calendar_updated, -license)
We remove the neighbourhood_group_cleansed, calendar_updated, and license columns from the dataset because we look at the dataset and realize that these columns are almost all NAs. So we think they are not useful for our intend analysis, and that removing them would reduce the clutter and focus only on those variables that were meaningful.
处理bathrooms, bathrooms_text
df <- df %>%
mutate(
# 如果 bathrooms 是 NA 且 bathrooms_text 也为 NA,则保持 NA
bathrooms = ifelse(is.na(bathrooms) & is.na(bathrooms_text), NA,
ifelse(is.na(bathrooms), as.numeric(str_extract(bathrooms_text, "\\d+\\.?\\d*")), bathrooms)),
# 如果 bathrooms_text 是 NA 且 bathrooms 为 NA,则保持 NA
bathrooms_text = ifelse(is.na(bathrooms_text) & is.na(bathrooms), NA,
ifelse(is.na(bathrooms_text), paste(bathrooms, "bath", ifelse(bathrooms > 1, "s", "")), bathrooms_text))
)
unique(df$bathrooms_text)
## [1] "1.5 baths" "2 baths" "1 bath"
## [4] "1 shared bath" "3 baths" "2 shared baths"
## [7] "2.5 baths" "" "Half-bath"
## [10] "0 baths" "1 private bath" "Private half-bath"
## [13] "3.5 baths" "1.5 shared baths" "5 baths"
## [16] "Shared half-bath" "0 shared baths" "4 baths"
## [19] "2.5 shared baths" "3 shared baths" "8 baths"
df <- df %>%
mutate(
bathrooms = case_when(
bathrooms_text %in% c("Half-bath", "Private half-bath", "Shared half-bath") ~ 0.5,
TRUE ~ as.numeric(str_extract(bathrooms_text, "\\d+\\.?\\d*")) # 提取数字并转换
)
)
str(df[, c("bathrooms", "bathrooms_text")])
## 'data.frame': 21707 obs. of 2 variables:
## $ bathrooms : num 1.5 2 1 1 2 1 1.5 1 1.5 1 ...
## $ bathrooms_text: chr "1.5 baths" "2 baths" "1 bath" "1 bath" ...
We clean and harmonize the bathrooms and bathrooms_text columns to ensure they are consistent. We do this by first filling missing bathrooms values using the numeric information extracted from bathrooms_text, and vice versa, ensuring that neither field is unnecessarily left blank if useful information is available. Then, we further refine the bathrooms column by assigning a value of 0.5 to listings described as having a “half-bath” and extracting numerical values from other bathroom descriptions. This approach improves the quality and completeness of our bathroom data while handling special cases like half-baths appropriately.
处理其余NA和空值
df <- df %>%
mutate(across(where(is.character), ~ ifelse(is.na(.) | . == "", "Unknown", .)))
df <- df %>%
mutate(across(where(is.numeric), ~ ifelse(. == "", NA, .)))
We do this by replacing any NA or empty strings in character columns with the label “Unknown,” ensuring that missing text information is clearly identified rather than left blank. For numeric columns, we replace any empty strings with NAs, recognizing that numeric fields should either have a valid number or be explicitly missing.
The reason we don’t delete rows that contain NA values is because, for a row of data, just because he has NA in one of the columns doesn’t mean that all the columns are NA, and we can still analyze it using other columns instead of simply deleting an entire column. Similarly, we do not populate the NA values with mean, median, etc., as we believe that this affects the accuracy of the original data set. In subsequent analyses, we will extract the desired columns and drop the NA values from them, but not change the dataset itself.
# 1. 选取所有数值型列
numeric_cols <- df %>% dplyr::select(where(is.numeric))
summary(numeric_cols)
## host_response_rate host_acceptance_rate host_listings_count
## Min. :0.000 Min. :0.000 Min. : 1.000
## 1st Qu.:1.000 1st Qu.:0.430 1st Qu.: 1.000
## Median :1.000 Median :0.750 Median : 1.000
## Mean :0.887 Mean :0.671 Mean : 6.002
## 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.: 1.000
## Max. :1.000 Max. :1.000 Max. :3284.000
## NA's :10524 NA's :4663 NA's :2
## host_total_listings_count latitude longitude accommodates
## Min. : 1.0 Min. :55.62 Min. :12.45 Min. : 1.000
## 1st Qu.: 1.0 1st Qu.:55.67 1st Qu.:12.54 1st Qu.: 2.000
## Median : 1.0 Median :55.68 Median :12.55 Median : 3.000
## Mean : 12.5 Mean :55.68 Mean :12.56 Mean : 3.309
## 3rd Qu.: 2.0 3rd Qu.:55.70 3rd Qu.:12.58 3rd Qu.: 4.000
## Max. :8065.0 Max. :55.73 Max. :12.64 Max. :16.000
## NA's :2
## bathrooms bedrooms beds price
## Min. :0.000 Min. :0.00 Min. : 0.00 Min. : 72
## 1st Qu.:1.000 1st Qu.:1.00 1st Qu.: 1.00 1st Qu.: 895
## Median :1.000 Median :1.00 Median : 1.00 Median : 1150
## Mean :1.092 Mean :1.61 Mean : 1.82 Mean : 1390
## 3rd Qu.:1.000 3rd Qu.:2.00 3rd Qu.: 2.00 3rd Qu.: 1600
## Max. :8.000 Max. :8.00 Max. :18.00 Max. :100000
## NA's :10 NA's :712 NA's :9223 NA's :9236
## minimum_nights maximum_nights minimum_minimum_nights
## Min. : 1.000 Min. : 1.0 Min. : 1.000
## 1st Qu.: 2.000 1st Qu.: 20.0 1st Qu.: 2.000
## Median : 3.000 Median : 150.0 Median : 3.000
## Mean : 4.567 Mean : 328.3 Mean : 4.305
## 3rd Qu.: 4.000 3rd Qu.: 365.0 3rd Qu.: 4.000
## Max. :1111.000 Max. :1125.0 Max. :1111.000
##
## maximum_minimum_nights minimum_maximum_nights maximum_maximum_nights
## Min. : 1.000 Min. : 1.0 Min. : 1.0
## 1st Qu.: 2.000 1st Qu.: 21.0 1st Qu.: 21.0
## Median : 3.000 Median : 365.0 Median : 365.0
## Mean : 4.884 Mean : 395.3 Mean : 398.2
## 3rd Qu.: 4.000 3rd Qu.: 730.0 3rd Qu.: 730.0
## Max. :1111.000 Max. :1125.0 Max. :1125.0
##
## minimum_nights_avg_ntm maximum_nights_avg_ntm availability_30 availability_60
## Min. : 1.000 Min. : 1.0 Min. : 0.000 Min. : 0.00
## 1st Qu.: 2.000 1st Qu.: 21.0 1st Qu.: 0.000 1st Qu.: 0.00
## Median : 3.000 Median : 365.0 Median : 0.000 Median : 0.00
## Mean : 4.607 Mean : 396.7 Mean : 7.666 Mean :16.88
## 3rd Qu.: 4.000 3rd Qu.: 730.0 3rd Qu.:17.000 3rd Qu.:35.00
## Max. :1111.000 Max. :1125.0 Max. :30.000 Max. :60.00
##
## availability_90 availability_365 number_of_reviews number_of_reviews_ltm
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 2.00 1st Qu.: 0.000
## Median : 1.00 Median : 17.00 Median : 8.00 Median : 2.000
## Mean :26.36 Mean : 90.85 Mean : 18.49 Mean : 5.162
## 3rd Qu.:58.00 3rd Qu.:166.00 3rd Qu.: 19.00 3rd Qu.: 6.000
## Max. :90.00 Max. :365.00 Max. :2138.00 Max. :569.000
##
## number_of_reviews_l30d review_scores_rating review_scores_accuracy
## Min. : 0.0000 Min. :0.000 Min. :1.00
## 1st Qu.: 0.0000 1st Qu.:4.760 1st Qu.:4.80
## Median : 0.0000 Median :4.920 Median :4.92
## Mean : 0.2568 Mean :4.835 Mean :4.85
## 3rd Qu.: 0.0000 3rd Qu.:5.000 3rd Qu.:5.00
## Max. :87.0000 Max. :5.000 Max. :5.00
## NA's :2630 NA's :2632
## review_scores_cleanliness review_scores_checkin review_scores_communication
## Min. :1.000 Min. :1.00 Min. :1.000
## 1st Qu.:4.600 1st Qu.:4.83 1st Qu.:4.900
## Median :4.820 Median :4.96 Median :5.000
## Mean :4.715 Mean :4.88 Mean :4.917
## 3rd Qu.:5.000 3rd Qu.:5.00 3rd Qu.:5.000
## Max. :5.000 Max. :5.00 Max. :5.000
## NA's :2632 NA's :2632 NA's :2632
## review_scores_location review_scores_value calculated_host_listings_count
## Min. :1.000 Min. :1.000 Min. : 1.000
## 1st Qu.:4.770 1st Qu.:4.630 1st Qu.: 1.000
## Median :4.910 Median :4.780 Median : 1.000
## Mean :4.845 Mean :4.726 Mean : 4.297
## 3rd Qu.:5.000 3rd Qu.:4.940 3rd Qu.: 1.000
## Max. :5.000 Max. :5.000 Max. :223.000
## NA's :2633 NA's :2632
## calculated_host_listings_count_entire_homes
## Min. : 0.000
## 1st Qu.: 1.000
## Median : 1.000
## Mean : 4.086
## 3rd Qu.: 1.000
## Max. :223.000
##
## calculated_host_listings_count_private_rooms
## Min. : 0.0000
## 1st Qu.: 0.0000
## Median : 0.0000
## Mean : 0.2038
## 3rd Qu.: 0.0000
## Max. :16.0000
##
## calculated_host_listings_count_shared_rooms reviews_per_month
## Min. :0.000000 Min. : 0.0100
## 1st Qu.:0.000000 1st Qu.: 0.2000
## Median :0.000000 Median : 0.4100
## Mean :0.005574 Mean : 0.7131
## 3rd Qu.:0.000000 3rd Qu.: 0.8300
## Max. :7.000000 Max. :82.6000
## NA's :2630
Finally, we looked at all of the value columns in order to confirm that there were no outliers or values in there that shouldn’t be there. We did find some extreme values, such as a very high price ($100,000), or a very long booking time (1125 nights), and so on. But we analyzed these as reasonable, perhaps the landlord owns a great big villa, and the price of the villa is very high. Also including the very long living time, since some people may indeed wish to stay for a long time, which is also reasonable. We checked all the columns one by one and didn’t find impossible values like negative prices, scores that are not in the normal range of 1-5, or latitude and longitude coordinates that are not in Copenhagen but in the Pacific Ocean or something like that, all the data inside the columns are reasonable in our opinion, some of them may be very extreme but not unexplainable.
summary_stats <- df %>%
group_by(neighbourhood_cleansed) %>%
summarise(
listing_count = n(), # 每个区域的房源数量
avg_price = mean(price, na.rm = TRUE), # 平均房价
median_price = median(price, na.rm = TRUE), # 中位数房价
avg_review_score = mean(review_scores_rating, na.rm = TRUE), # 平均评分
superhost_rate = mean(host_is_superhost == TRUE, na.rm = TRUE), # 超赞房东比例
instant_bookable_rate = mean(instant_bookable == TRUE, na.rm = TRUE), # 即时预订比例
avg_minimum_nights = mean(minimum_nights, na.rm = TRUE) # 平均最短入住天数
) %>%
arrange(desc(listing_count))
summary_stats
In our analysis of Copenhagen’s Airbnb listings by neighborhood, we found that Nrrebro has the highest number of listings (3896), followed closely by Vesterbro-Kongens Enghave (3734) and Indre By (3073). This suggests that these areas are major hubs for short-term rentals. In terms of price, Indre By clearly stands out with an average listing price of DKK 1899.76 and a median price of DKK 1600, significantly higher than other neighborhoods. In contrast, Bispebjerg has one of the lowest average prices at DKK 953.19. This price gap indicates that staying in the city center (Indre By) comes at a premium, likely reflecting its proximity to major attractions.
Looking at host and booking characteristics, we noticed that Indre By also has the highest superhost rate (18.35%) and the highest instant bookable rate (15.36%), suggesting a more professional and tourist-ready host base compared to other areas like Nrrebro and sterbro, where the superhost rates are closer to 10%. Review scores were relatively high and consistent across neighborhoods, all clustering around 4.8 out of 5, indicating strong guest satisfaction city-wide. Finally, when considering minimum stay requirements, Brnshj-Husum had the longest average minimum stay at 5.54 nights, while Vesterbro-Kongens Enghave had the shortest at 4.2 nights, hinting at different hosting strategies depending on location—possibly with more residential areas encouraging longer stays.
ggplot(df %>% count(neighbourhood_cleansed),
aes(x = reorder(neighbourhood_cleansed, n), y = n)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(title = "Number of Listings per Neighbourhood",
x = "Neighbourhood", y = "Listing Count") +
theme_minimal()
ggplot(df, aes(x = neighbourhood_cleansed, y = price)) +
geom_boxplot(outlier.color = "red", fill = "skyblue") +
coord_flip() +
labs(title = "Price Distribution by Neighbourhood",
x = "Neighbourhood", y = "Price") +
theme_minimal()
## Warning: Removed 9236 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
ggplot(df, aes(x = longitude, y = latitude, color = neighbourhood_cleansed)) +
geom_point(alpha = 0.5, size = 1) +
theme_minimal() +
labs(title = "Copenhagen Airbnb Listings by Area",
x = "Longitude", y = "Latitude", color = "Neighbourhood")
df %>%
group_by(neighbourhood_cleansed) %>%
summarise(superhost_rate = mean(host_is_superhost == TRUE, na.rm = TRUE)) %>%
ggplot(aes(x = reorder(neighbourhood_cleansed, superhost_rate), y = superhost_rate)) +
geom_col(fill = "orange") +
coord_flip() +
labs(title = "Superhost Ratio by Neighbourhood",
x = "Neighbourhood", y = "Superhost Rate") +
theme_minimal()
ggplot(df, aes(x = review_scores_rating, y = price, color = neighbourhood_cleansed)) +
geom_point(alpha = 0.4, size = 1) +
labs(title = "Price vs. Review Score by Neighbourhood",
x = "Review Score", y = "Price") +
theme_minimal() +
facet_wrap(~ neighbourhood_cleansed, scales = "free", ncol = 3)
## Warning: Removed 10929 rows containing missing values or values outside the scale range
## (`geom_point()`).
To dive into Copenhagen’s Airbnb landscape, we crafted five distinct
visualizations, each spotlighting a different facet of the market. Our
first chart, a horizontal bar plot of listing counts, makes it
immediately clear that Nrrebro and Vesterbro-Kongens Enghave dominate
the scene with nearly 4000 listings apiece, while tucked-away
neighborhoods like Brnshj-Husum and Vanlse each hover below 500. Next, a
boxplot of prices reveals that although most areas share similar median
price points, Vesterbro-Kongens Enghave boasts a handful of sky-high
outliers pushing into the six-figure DKK range, which shows the evidence
of luxury offerings interspersed among more budget-friendly options.
The geospatial scatter of longitude vs. latitude then brings the data to life on a map, showing tight clusters of listings around Indre By and its neighboring districts, and sparser dots as you move toward the city’s edge. Shifting to host quality, our bar chart of superhost rates confirms that Indre By leads the pack with over 15% superhosts, whereas areas like sterbro sit closer to 10%, hinting at more polished operations in the heart of the city. Finally, facetted scatter plots of price against review score demonstrate a clear pattern, higher-rated listings generally command higher prices—but the strength of that relationship varies. In Vesterbro-Kongens Enghave, you’ll find steep price tags even among average-scoring properties, while Vanlse shows a tighter, more predictable price–rating link, underlining how different neighborhoods cater to distinct guest expectations.
# 获取唯一区域名
areas <- unique(df$neighbourhood_cleansed)
# 创建颜色函数
pal <- colorFactor("Set3", domain = df$neighbourhood_cleansed)
# 初始化地图
m <- leaflet() %>%
addTiles()
# 为每个区域添加图层
for (area in areas) {
subdf <- df[df$neighbourhood_cleansed == area, ]
m <- m %>%
addCircleMarkers(
data = subdf,
lng = ~longitude, lat = ~latitude,
color = ~pal(neighbourhood_cleansed),
radius = 4,
opacity = 0.7, fillOpacity = 0.7,
popup = ~paste("Neighbourhood:", neighbourhood_cleansed, "<br>",
"Price:", price, "<br>",
"Review Score:", review_scores_rating),
group = area,
clusterOptions = markerClusterOptions()
)
}
# 添加图层控件
m <- m %>% addLayersControl(
overlayGroups = areas,
options = layersControlOptions(collapsed = FALSE)
)
# 展示地图
m
Our interactive leaflet map immediately highlights how tightly Airbnb activity is concentrated in Copenhagen’s core districts. When all layers are toggled on, you can see a dense “cloud” of points around Indre By, Vesterbro-Kongens Enghave, and Nrrebro. These neighborhoods host the vast majority of listings, sitting cheek-by-jowl around Nyhavn, the Tivoli Gardens, and the pedestrian streets of Strget. Just to the west, Frederiksberg and sterbro form smaller but still substantial clusters, reflecting their residential appeal and good transport links. In contrast, areas like Brnshj-Husum, Valby, and Vanlse appear as much lighter sprinklings, reminding us that Copenhagen’s short-term rental market thins out as you cross into the quieter suburbs.
Zooming and toggling individual neighborhood layers reveals subtler patterns. Amager Vest and Amager st listings trace the coastline and metro line, suggesting a growing appetite for waterfront stays and easy airport access. Marker clusters in those zones often pop up around Bella Center and the new restad developments, pointing to business travelers or convention-goers. The color-coding also makes it obvious that superhost-heavy Indre By has a wide price spread, pale dots from budget rooms sit alongside darker, expensive listings, whereas outsider districts show a more uniform price band. Overall, the map teaches us not just where hosts are, but also hints at why they’re there: proximity to attractions, transit corridors, and the balance between tourist and local people.
# 抽取 neighbourhood_overview 列,并移除缺失值
text_data <- df$neighborhood_overview
text_data <- text_data[text_data != "Unknown"]
# 创建文本语料库
corpus <- Corpus(VectorSource(text_data))
# 文本清洗:小写、去标点、去数字、去停用词
corpus <- corpus %>%
tm_map(content_transformer(tolower)) %>%
tm_map(removePunctuation) %>%
tm_map(removeNumbers) %>%
tm_map(removeWords, stopwords("en")) %>%
tm_map(stripWhitespace)
## Warning in tm_map.SimpleCorpus(., content_transformer(tolower)): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(., removePunctuation): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(., removeNumbers): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(., removeWords, stopwords("en")): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(., stripWhitespace): transformation drops
## documents
# 创建词频矩阵
dtm <- TermDocumentMatrix(corpus)
m <- as.matrix(dtm)
word_freq <- sort(rowSums(m), decreasing = TRUE)
# 转换为数据框
word_df <- data.frame(word = names(word_freq), freq = word_freq)
# 生成词云
set.seed(699)
wordcloud(words = word_df$word, freq = word_df$freq,
min.freq = 5,
max.words = 200,
random.order = FALSE,
rot.per = 0.35,
colors = brewer.pal(8, "Dark2"))
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## neighbourhood could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## copenhagens could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## famous could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## playgrounds could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## atmosphere could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## trendy could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## square could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## centre could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## short could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## perfect could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## lively could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## nature could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## offers could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## unique could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## home could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## bakery could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## wine could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## nyhavn could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## harbour could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## residential could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## access could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## friendly could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## large could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## opportunities could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## gardens could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## experience could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## attractions could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## buildings could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## harbor could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## transport could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## cultural could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## christianshavn could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## strøget could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## carlsberg could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## options could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## need could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## brygge could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## summer could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## cemetery could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## explore could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## swimming could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## diverse could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## supermarket could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## assistens could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## playground could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## across could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## delicious could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## transportation could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## near could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## along could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## stroll could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## royal could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## family could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## canals could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## houses could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## middle could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## building could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## christiania could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## boutiques could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## fælledparken could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## jægersborggade could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## architecture could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = word_df$word, freq = word_df$freq, min.freq = 5, :
## swim could not be fit on page. It will not be plotted.
A. Modeling Process
We began by selecting variables grounded in both theory and data availability: host acceptance rate (a proxy for host responsiveness), neighborhood (to capture location premium), room type, and key listing characteristics (accommodates, bathrooms, bedrooms, minimum nights, and average review score). For some highly relevant variables, such as property type and room type, we only retain one of them. We transformed the outcome with a natural logarithm, log(price), to stabilize variance and improve linearity with predictors. Because the magnitude of the price is a bit too high compared to other numerical variables. To prepare the data, we dropped any rows with missing values among these fields and converted categorical variables to factors.
We checked pairwise correlations and variance to guard against multicollinearity. For example, we found that accommodates, bedrooms, and bathrooms were moderately correlated but each added distinct explanatory power, so we retained them. We also compared models with and without neighborhood fixed effects; including the ten neighborhood dummies raised adjusted R-squared by nearly 0.05, so we kept them to account for locational heterogeneity. We judged our final model’s quality by overall fit (adjusted R-squared), the significance of individual coefficients matching our expectations, and residual diagnostics to ensure approximately normal errors.
# Select only the variables you listed, and drop rows with any NA
model_df <- df %>%
dplyr::select(
price,
host_acceptance_rate,
neighbourhood_cleansed,
room_type,
accommodates,
bathrooms,
bedrooms,
minimum_nights,
review_scores_rating
) %>%
drop_na() %>%
mutate(
neighbourhood_cleansed = as.factor(neighbourhood_cleansed),
room_type = as.factor(room_type),
log_price = log(price)
)
# Fit the multiple regression
model_price <- lm(log_price ~ . - price, data = model_df)
# View the summary
summary(model_price)
##
## Call:
## lm(formula = log_price ~ . - price, data = model_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.87709 -0.19364 -0.02646 0.17296 2.54235
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 5.7903781 0.0593842 97.507
## host_acceptance_rate 0.0316928 0.0102218 3.100
## neighbourhood_cleansedAmager Vest 0.0797488 0.0153965 5.180
## neighbourhood_cleansedBispebjerg -0.1114922 0.0187064 -5.960
## neighbourhood_cleansedBrnshj-Husum -0.2905096 0.0280438 -10.359
## neighbourhood_cleansedFrederiksberg 0.1334874 0.0149293 8.941
## neighbourhood_cleansedIndre By 0.3856567 0.0136777 28.196
## neighbourhood_cleansedNrrebro 0.1003743 0.0137765 7.286
## neighbourhood_cleansedsterbro 0.1515594 0.0150117 10.096
## neighbourhood_cleansedValby -0.1183802 0.0202857 -5.836
## neighbourhood_cleansedVanlse -0.1717469 0.0239928 -7.158
## neighbourhood_cleansedVesterbro-Kongens Enghave 0.1454514 0.0136350 10.667
## room_typeHotel room -0.5052774 0.1442227 -3.503
## room_typePrivate room -0.5538080 0.0115998 -47.743
## room_typeShared room -0.5841692 0.2274887 -2.568
## accommodates 0.0852252 0.0032628 26.120
## bathrooms 0.1205654 0.0115072 10.477
## bedrooms 0.1084345 0.0061636 17.593
## minimum_nights -0.0023280 0.0003196 -7.284
## review_scores_rating 0.1253444 0.0117744 10.646
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## host_acceptance_rate 0.001937 **
## neighbourhood_cleansedAmager Vest 2.26e-07 ***
## neighbourhood_cleansedBispebjerg 2.60e-09 ***
## neighbourhood_cleansedBrnshj-Husum < 2e-16 ***
## neighbourhood_cleansedFrederiksberg < 2e-16 ***
## neighbourhood_cleansedIndre By < 2e-16 ***
## neighbourhood_cleansedNrrebro 3.43e-13 ***
## neighbourhood_cleansedsterbro < 2e-16 ***
## neighbourhood_cleansedValby 5.52e-09 ***
## neighbourhood_cleansedVanlse 8.72e-13 ***
## neighbourhood_cleansedVesterbro-Kongens Enghave < 2e-16 ***
## room_typeHotel room 0.000461 ***
## room_typePrivate room < 2e-16 ***
## room_typeShared room 0.010245 *
## accommodates < 2e-16 ***
## bathrooms < 2e-16 ***
## bedrooms < 2e-16 ***
## minimum_nights 3.47e-13 ***
## review_scores_rating < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3215 on 10371 degrees of freedom
## Multiple R-squared: 0.573, Adjusted R-squared: 0.5722
## F-statistic: 732.5 on 19 and 10371 DF, p-value: < 2.2e-16
B. The regression equation predicts the logarithm of nightly price as a sum of an overall baseline plus adjustments for each feature. Concretely, the intercept (≈ 5.79) represents the log-price for an average entire home in the reference neighborhood (Amager Vest) with zero values on the numeric predictors. Each coefficient then shows how a one-unit increase in that predictor shifts log-price: for example, an extra bathroom adds about 0.121 to log‐price (about 13 % higher price), each additional bedroom adds 0.108 (≈ 11 % more), and accommodating one more guest adds 0.085 (≈ 9 % more). Categorical effects work the same way on the log scale: being in Indre By adds 0.386 (≈ 47 % premium over Amager Vest), whereas listing a private room subtracts 0.554 (≈ 42 % discount relative to an entire home). Host acceptance rate and review score also boost price, higher responsiveness and better reviews translate into noticeably higher nightly rates.
C. Model Performance
With an adjusted R² of 0.572, our model explains about 57 % of the variation in nightly prices on the log scale—a respectable fit for cross‐sectional listing data. The F-statistic is 732.5, with p-value < 2.2 × 10^-16 confirms that the set of predictors collectively contributes significant explanatory power beyond intercept‐only. The residual standard error of 0.322 on log prices translates to a typical multiplicative error of exp(0.322) ≈ 1.38, meaning our model’s predictions are, on average, within ±38 % of the true price.
However, there is still substantial unexplained variance, likely driven by unobserved factors such as interior quality, seasonality, or dynamic demand conditions. A more flexible modeling approach like tree-based ensembles might capture nonlinearity better. Nevertheless, our linear model offers clear, interpretable insights into how location, room type, and host/listing attributes shift expected price, providing a solid foundation for further refinement.
# 1. 构造二元目标并保证它是因子,显式指定水平
df_knn <- df %>%
filter(!is.na(amenities)) %>%
mutate(
has_wifi = factor(grepl("internet|wifi", amenities, ignore.case = TRUE),
levels = c(FALSE, TRUE))
) %>%
select(has_wifi, price, accommodates, bathrooms, bedrooms, beds, review_scores_rating) %>%
drop_na()
# 2. 缩放数值变量
predictors <- scale(df_knn %>% select(-has_wifi))
knn_data <- bind_cols(has_wifi = df_knn$has_wifi, as.data.frame(predictors))
# 3. 划分训练/测试集
set.seed(699)
idx <- createDataPartition(knn_data$has_wifi, p = 0.7, list = FALSE)
train <- knn_data[idx, ]; test <- knn_data[-idx, ]
train_x <- train %>% select(-has_wifi)
test_x <- test %>% select(-has_wifi)
train_y <- train$has_wifi
test_y <- test$has_wifi
# 4. 训练并预测
k <- round(sqrt(nrow(train)))
pred_y <- knn(train = train_x, test = test_x, cl = train_y, k = k)
# 5. 确保 pred_y 也是同样的因子水平
pred_y <- factor(pred_y, levels = levels(test_y))
# 6. 混淆矩阵
confusionMatrix(data = pred_y, reference = test_y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 0 0
## TRUE 74 3156
##
## Accuracy : 0.9771
## 95% CI : (0.9713, 0.982)
## No Information Rate : 0.9771
## P-Value [Acc > NIR] : 0.5309
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.00000
## Specificity : 1.00000
## Pos Pred Value : NaN
## Neg Pred Value : 0.97709
## Prevalence : 0.02291
## Detection Rate : 0.00000
## Detection Prevalence : 0.00000
## Balanced Accuracy : 0.50000
##
## 'Positive' Class : FALSE
##
For our first classification task, we chose Wi-Fi availability as the binary outcome, which is one of the most basic amenity that most guests expect.
We created the has_wifi indicator by searching the raw amenities text for “internet” or “wifi,” then encoded it as a factor with levels FALSE (no Wi-Fi) and TRUE (has Wi-Fi). To predict this indicator, we selected six numeric features that are both commonly reported and likely correlated with hosting quality such as nightly price, accommodates, bathrooms, bedrooms, beds, and overall review_scores_rating. We scaled all predictors to have zero mean and unit variance. Then we set our test seed at 699 ensure reproducibility and then split the data 70/30 into training and test sets.
We then built a confusion matrix after fitting knn() and predicting on the held-out 30%. We observed the model to achieve 97.71 % accuracy, but this exactly matches the naive benchmark of always predicting “Wi-Fi” is “TRUE” (since over 97 % of listings offer Wi-Fi). In effect, k-NN simply learned the overwhelming majority class; sensitivity is 0 % which means it never correctly predicts a non-Wi-Fi listing, and meanwhile specificity is 100 % which shows that it never falsely predicts missing Wi-Fi.
Thus, since nearly every listing offers Wi-Fi, our k-NN model simply predicts “Yes” for all cases and it matches the baseline of always choosing the majority, and yields no real improvement. Predicting less common amenities will require either features that capture rare cases or models specifically designed to handle imbalanced outcomes.
set.seed(699)
# 1. Prepare & bin the target
df_nb <- df %>%
filter(!is.na(review_scores_value)) %>%
# equal-frequency binning into tertiles
mutate(
value_bin = ntile(review_scores_value, 3),
value_bin = factor(value_bin,
levels = 1:3,
labels = c("Low", "Medium", "High"))
) %>%
select(-review_scores_value) # drop original
# 2. Select predictors (no other review_scores_*)
df_nb <- df_nb %>%
select(
value_bin,
price,
accommodates,
bedrooms,
bathrooms,
beds,
number_of_reviews,
host_response_time,
host_response_rate,
host_acceptance_rate,
host_is_superhost,
host_identity_verified,
neighbourhood_cleansed,
room_type
) %>%
drop_na() %>%
mutate(
host_response_time = factor(host_response_time),
host_is_superhost = factor(host_is_superhost),
host_identity_verified = factor(host_identity_verified),
neighbourhood_cleansed = factor(neighbourhood_cleansed),
room_type = factor(room_type),
log_price = log(price)
) %>%
select(-price)
# 4. Train/test split
train_idx <- createDataPartition(df_nb$value_bin, p = 0.7, list = FALSE)
train_nb <- df_nb[train_idx, ]
test_nb <- df_nb[-train_idx, ]
# 5. Fit Naive Bayes
model_nb <- naiveBayes(value_bin ~ ., data = train_nb, laplace = 1)
# 6. Predict & evaluate
pred_nb <- predict(model_nb, test_nb)
conf_mat <- confusionMatrix(pred_nb, test_nb$value_bin)
print(conf_mat)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Low Medium High
## Low 134 87 42
## Medium 145 301 110
## High 546 432 584
##
## Overall Statistics
##
## Accuracy : 0.428
## 95% CI : (0.408, 0.4481)
## No Information Rate : 0.3465
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.1569
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: Low Class: Medium Class: High
## Sensitivity 0.16242 0.3671 0.7935
## Specificity 0.91710 0.8366 0.4055
## Pos Pred Value 0.50951 0.5414 0.3739
## Neg Pred Value 0.67375 0.7156 0.8144
## Prevalence 0.34649 0.3444 0.3091
## Detection Rate 0.05628 0.1264 0.2453
## Detection Prevalence 0.11046 0.2335 0.6560
## Balanced Accuracy 0.53976 0.6019 0.5995
We first transformed each listing’s review_scores_value into three equal‐sized bins—Low, Medium, and High, so that our response classes were roughly balanced. To prevent any leakage from other satisfaction measures, we excluded all other review_scores_* columns. We then selected a mix of numeric and categorical predictors that seemed most likely to influence perceived value:
Numeric: log(price), accommodates, bedrooms, bathrooms, beds, number_of_reviews, host_response_rate, host_acceptance_rate
Categorical: host_response_time, host_is_superhost, host_identity_verified, neighbourhood_cleansed, room_type
After dropping rows with any missing values, then using the same method, we trained a Laplace‐smoothed Naïve Bayes model on the 70% training set and checked how well it worked on the 30% test set. The overall accuracy of 42.8% significantly exceeds the no‐information rate of 34.7% (p < .001), and Cohen’s k = 0.157 indicates modest agreement beyond chance.
By class, the model is best at catching High‐value stays (sensitivity = 0.79) but struggles with Low‐value ones (sensitivity = 0.16), suggesting that features like location, host responsiveness, and price are stronger signals of exceptional value than of poor value.
# 1. Build a one‐row data.frame matching train_nb's predictors
new_listing <- data.frame(
host_response_rate = 0.8,
host_response_time = factor("within an hour",
levels = levels(train_nb$host_response_time)),
host_acceptance_rate = 0.9,
neighbourhood_cleansed = factor("Indre By",
levels = levels(train_nb$neighbourhood_cleansed)),
room_type = factor("Entire home/apt",
levels = levels(train_nb$room_type)),
accommodates = 2,
bathrooms = 1,
bedrooms = 1,
beds = 1,
minimum_nights = 2,
number_of_reviews = 100,
host_is_superhost = TRUE,
host_identity_verified = TRUE,
log_price = log(1200) # log of 1 200 DKK
)
# 2. Predict with the trained naive Bayes model
predicted_bin <- predict(model_nb, new_listing)
print(predicted_bin)
## [1] Medium
## Levels: Low Medium High
The model assigns our example listing to the “Medium” value bin, showing that guests with similar stays typically report neither poor nor exceptional value for the price they paid. In other words, this listing sits in the middle third of all Copenhagen rentals when it comes to perceived bang-for-the-buck.
Our hypothetical apartment includes a two-guest, one-bedroom unit in Indre By, priced at DKK 1200 per night, with a host acceptance rate of 90 %, an 80 % on-time response rate, superhost status, and 100 prior reviews, matches the profile of many “average” value stays. The model has learned that moderate pricing combined with solid service indicators and review counts tends to produce a middling perception of value.
To push the perception into the “High” value category, the host could consider either enhancing service (for example, faster or more consistent responses, additional amenities, or personalized touches) or tweaking the nightly rate downward.
df_tree <- df %>%
# 只保留有 minimum_nights 的行
filter(!is.na(minimum_nights)) %>%
# 两类:短租 ≤ 30 晚,长租 > 30 晚
mutate(
stay_type = factor(
ifelse(minimum_nights <= 7, "ShortTerm", "LongTerm"),
levels = c("ShortTerm","LongTerm")
)
) %>%
select(-minimum_nights, # 去掉原始目标列
-maximum_nights, # 不用任何最短/最长夜数特征
-minimum_minimum_nights,
-maximum_minimum_nights,
-minimum_maximum_nights,
-maximum_maximum_nights)
df_tree <- df_tree %>%
select(
stay_type,
price,
accommodates,
bedrooms,
bathrooms,
beds,
number_of_reviews,
host_response_time,
host_response_rate,
host_acceptance_rate,
host_is_superhost,
host_identity_verified,
neighbourhood_cleansed,
room_type,
review_scores_rating
) %>%
drop_na() %>%
mutate(
host_response_time = factor(host_response_time),
host_is_superhost = factor(host_is_superhost),
host_identity_verified = factor(host_identity_verified),
neighbourhood_cleansed = factor(neighbourhood_cleansed),
room_type = factor(room_type),
log_price = log(price)
) %>%
select(-price)
set.seed(699)
train_idx <- createDataPartition(df_tree$stay_type, p = 0.7, list = FALSE)
train_tree <- df_tree[train_idx, ]
test_tree <- df_tree[-train_idx, ]
# 4.1 训练模型
tree_fit <- rpart(
stay_type ~ .,
data = train_tree,
method = "class",
control = rpart.control(cp = 0.01) # 可以调节 cp 以剪枝
)
# 4.2 输出模型摘要
printcp(tree_fit)
##
## Classification tree:
## rpart(formula = stay_type ~ ., data = train_tree, method = "class",
## control = rpart.control(cp = 0.01))
##
## Variables actually used in tree construction:
## [1] log_price neighbourhood_cleansed room_type
##
## Root node error: 170/5560 = 0.030576
##
## n= 5560
##
## CP nsplit rel error xerror xstd
## 1 0.015686 0 1.00000 1.0000 0.075515
## 2 0.010000 6 0.88824 1.0529 0.077423
# 4.3 绘制分类树
rpart.plot(tree_fit,
type = 2, # 节点类型
extra = 104, # 显示类别和概率
fallen.leaves = TRUE,
main = "Classification Tree for Short- vs Long-Term Stay")
We began by turning each listing’s minimum-nights requirement into a simple two-level outcome: “ShortTerm” (≤ 7 nights) versus “LongTerm” (> 7 nights), and then stripped out every other “nights” column so the tree couldn’t peek at the raw minimum/maximum values.
After converting all textual fields (neighborhood, room type, host response time, etc.) into factors and taking the natural log of price, we split the data 70/30 (with seed 699) and fitted an rpart classification tree. We used cross-validation to scan complexity parameters (cp) from 0.01 downward, then pruned at the cp value (0.0157) that minimized the cross-validated error.
# 预测测试集
pred_tree <- predict(tree_fit, test_tree, type = "class")
# 混淆矩阵
conf_mat_tree <- confusionMatrix(pred_tree, test_tree$stay_type)
print(conf_mat_tree)
## Confusion Matrix and Statistics
##
## Reference
## Prediction ShortTerm LongTerm
## ShortTerm 2306 65
## LongTerm 3 7
##
## Accuracy : 0.9714
## 95% CI : (0.9639, 0.9778)
## No Information Rate : 0.9698
## P-Value [Acc > NIR] : 0.3434
##
## Kappa : 0.1646
##
## Mcnemar's Test P-Value : 1.389e-13
##
## Sensitivity : 0.99870
## Specificity : 0.09722
## Pos Pred Value : 0.97259
## Neg Pred Value : 0.70000
## Prevalence : 0.96976
## Detection Rate : 0.96850
## Detection Prevalence : 0.99580
## Balanced Accuracy : 0.54796
##
## 'Positive' Class : ShortTerm
##
Our full tree (before pruning) consistently split first on log(price)—listings above about DKK 665/night were almost always short-term stays—then dove into room type (private/shared rooms tend to be short-term) and even specific neighborhoods for the cheapest entire-place rentals. But when we trimmed back to the “best” cp, the tree collapsed to a single node predicting “ShortTerm” for all listings. On our held-out 30% test set, it reported 97.1% overall accuracy, which is essentially identical to the 96.98% you’d get by always guessing “ShortTerm.” Sensitivity for short-term stays was nearly 100%, but specificity for long-term was under 10%, yielding a balanced-accuracy of just 55%.
printcp(tree_fit)
##
## Classification tree:
## rpart(formula = stay_type ~ ., data = train_tree, method = "class",
## control = rpart.control(cp = 0.01))
##
## Variables actually used in tree construction:
## [1] log_price neighbourhood_cleansed room_type
##
## Root node error: 170/5560 = 0.030576
##
## n= 5560
##
## CP nsplit rel error xerror xstd
## 1 0.015686 0 1.00000 1.0000 0.075515
## 2 0.010000 6 0.88824 1.0529 0.077423
opt_index <- which.min(tree_fit$cptable[, "xerror"])
opt_cp <- tree_fit$cptable[opt_index, "CP"]
cat("最佳剪枝 CP 值是:", opt_cp, "\n")
## 最佳剪枝 CP 值是: 0.01568627
pruned_tree <- prune(tree_fit, cp = opt_cp)
rpart.plot(pruned_tree,
type = 2,
extra = 104,
fallen.leaves = TRUE,
main = "Pruned Classification Tree")
Looking at the result here, it is safe to say that Price is the strongest single signal for distinguishing short-versus long-term rentals. But with such a heavy class imbalance, 97% of the listings in our sample are short‐term by our definition (≤7 nights), so the cost of making any further splits (on price, room type, etc.) outweighs the tiny gain in correctly identifying the 3% that are long‐term. The pruned tree therefore collapses to the trivial majority‐class classifier: always guess ShortTerm.
cluster_df <- df %>%
select(
price,
accommodates,
bedrooms,
bathrooms,
beds,
minimum_nights,
number_of_reviews,
review_scores_rating
) %>%
drop_na() %>%
mutate(log_price = log(price)) %>%
select(-price)
set.seed(699)
cluster_scaled <- scale(cluster_df)
We chose price, accommodates, bedrooms, bathrooms, beds, minimum_nights, number_of_reviews and review_scores_rating as our variables. As can be seen from the previously plotted box plots of prices, there are many outliers and these outliers have a large impact on the clustering. So we take logarithm of price before using K-means as a way to eliminate the effect of extreme values and use log_price instead of price. Finally, we normalize these variables.
fviz_nbclust(cluster_scaled, kmeans, method = "wss") +
labs(title = "Elbow Chart")
Before formal clustering, we plotted an Elbow Chart for determining the value of k. From the Elbow Chart, it is evident that the Total Within Sum of Squares (WSS) decreases sharply when increasing the number of clusters from 2 to 3, indicating a significant improvement in clustering performance. After K=3, the rate of decrease in WSS slows down considerably, forming a clear “elbow” shape. Thus, we chose K=3.
# K-means聚类
k <- 3
km_res <- kmeans(cluster_scaled, centers = k, nstart = 25)
# 将聚类结果加回数据框
df_kmeans <- cluster_df %>%
mutate(cluster = factor(km_res$cluster))
# 重新做PCA
pca_res <- prcomp(cluster_scaled, center = TRUE, scale. = TRUE)
# 查看主成分解释情况
summary(pca_res)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.7919 1.0366 1.0010 0.9789 0.85960 0.75520 0.50726
## Proportion of Variance 0.4014 0.1343 0.1252 0.1198 0.09236 0.07129 0.03216
## Cumulative Proportion 0.4014 0.5357 0.6609 0.7807 0.87309 0.94438 0.97654
## PC8
## Standard deviation 0.43319
## Proportion of Variance 0.02346
## Cumulative Proportion 1.00000
# 查看各变量在PCA上的贡献(Loadings矩阵)
pca_res$rotation
## PC1 PC2 PC3 PC4
## accommodates 0.507477016 -0.056555401 0.005444997 0.032102303
## bedrooms 0.505924467 -0.003840606 0.018676434 0.009694798
## bathrooms 0.332042566 0.013938186 0.037714295 -0.076076410
## beds 0.472196680 -0.121007124 0.024537613 -0.028626208
## minimum_nights -0.007183963 0.188720317 0.944326442 -0.258826774
## number_of_reviews -0.003891318 -0.709987534 -0.052700959 -0.670657431
## review_scores_rating 0.008939206 0.635981610 -0.313573320 -0.687803113
## log_price 0.391329632 0.194369426 -0.068840635 0.049219530
## PC5 PC6 PC7 PC8
## accommodates 0.16219921 0.12401630 0.373086585 -0.74654279
## bedrooms 0.03551632 0.21977773 0.521542961 0.64962464
## bathrooms -0.91323135 -0.18199382 -0.107456877 -0.06068976
## beds 0.16997399 0.42873060 -0.737244871 0.06881088
## minimum_nights 0.06729556 -0.02999241 0.002821848 -0.01237357
## number_of_reviews 0.07094956 -0.18706944 0.048756743 0.03062257
## review_scores_rating 0.03002787 0.14664147 0.006070546 -0.04004855
## log_price 0.31477476 -0.81362632 -0.176918981 0.09771787
# 聚类可视化(基于PCA降维)
fviz_cluster(km_res, data = cluster_scaled,
geom = "point", ellipse.type = "norm",
main = "K-means clustering results (PCA)")
# 汇总统计(按照cluster分组)
df_kmeans %>%
group_by(cluster) %>%
summarise(
sample_num = n(),
avg_price = exp(mean(log_price, na.rm = TRUE)), # 逆对数得到平均价格
avg_score = mean(review_scores_rating, na.rm = TRUE),
avg_accommodates = mean(accommodates, na.rm = TRUE),
avg_bedrooms = mean(bedrooms, na.rm = TRUE),
avg_bathrooms = mean(bathrooms, na.rm = TRUE),
avg_beds = mean(beds, na.rm = TRUE),
avg_minimum_nights = mean(minimum_nights, na.rm = TRUE),
avg_num_reviews = mean(number_of_reviews, na.rm = TRUE)
)
We use k=3 to divide the data into 3 clusters using k-means and add the cluster labels back to the dataframe. By analyzing the explained variance of the principal components and the Loadings matrix, we use PCA for dimensionality reduction to visually interpret the three clusters.
From the explanation of the principal components, we can see that PC1 explains 40.14% of the variance and PC2 explains 13.43%. Together, they account for most of the variability in the data. The variables that contribute significantly to PC1 (with large absolute coefficients) include accommodates, bedrooms, beds, log_price, and bathrooms, indicating that PC1 mainly represents the size and price level of the properties. The variables that contribute significantly to PC2 include number_of_reviews and review_scores_rating, suggesting that PC2 mainly represents the popularity and rating of the properties.
Therefore, the three clusters can be described as follows:
Cluster 1 (red): Medium-sized properties with moderate ratings and size.
Cluster 2 (green): Large/high-price properties with significant individual differences.
Cluster 3 (blue): Small/low-price properties, very concentrated.
ggplot(df_kmeans, aes(x = factor(cluster), y = accommodates, color = factor(cluster))) +
geom_boxplot(alpha = 0.6) +
labs(title = "Accommodates Distribution by Cluster", x = "Cluster", y = "Accommodates") +
theme_minimal()
The box plot illustrates the distribution of accommodation capacities
across three clusters. Cluster 2 shows the widest variation in the
number of people accommodated (typically 4-14, with some up to 16),
while cluster 1 generally accommodates 4-6 people and cluster 3
typically hosts 2-4 people.
# 绘制评分分布的直方图
ggplot(df_kmeans, aes(x = review_scores_rating, fill = factor(cluster))) +
geom_histogram(bins = 30, position = "dodge", alpha = 0.7) +
labs(title = "Review Scores Rating Distribution by Cluster", x = "Review Scores Rating", y = "Count") +
theme_minimal() +
facet_wrap(~cluster, ncol = 2)
The histogram shows the distribution of review scores ratings across
three clusters. All clusters have most reviews concentrated in the 4.5-5
range, but cluster 3 has the highest count of 5-star ratings, followed
by cluster 1 and then cluster 2.
ggplot(df_kmeans, aes(x = factor(cluster), y = beds, fill = factor(cluster))) +
geom_violin(alpha = 0.6) +
labs(title = "Beds Distribution by Cluster", x = "Cluster", y = "Number of Beds") +
theme_minimal() +
scale_fill_manual(values = c("skyblue", "orange", "green"))
The violin plot shows the distribution of the number of beds across
three clusters. Cluster 2 has the widest variation in bed numbers,
ranging from near 0 to 15 with a mid-concentration of 4-6. Cluster 1
typically has 1-4 beds, while cluster 3 mostly has 1-2 beds. Cluster 3
shows the most consistent bed counts.
With the initial and raw dataset, we approached it by cleaning and preparing it, then we ran descriptive analyses and plotting visualizations, fit a log-linear price model, carried out three classification tasks (Wi-Fi, perceived value, stay length), and distilled listings into three clusters—always balancing clarity with rigor.
Throughout the whole process, we found that location commands a notable price premium, Inner City listings run roughly 40–50% above the citywide baseline, while unit characteristics (bedrooms, bathrooms, accommodates) each add another 10–15% on the log scale. Meanwhile, Wi-Fi turns out to be nearly universal (97% of listings), so simple models can’t improve on the trivial “yes” prediction. Moreover, perceived-value bins (low/medium/high) proved modestly predictable, our Naive Bayes model achieved 43% accuracy versus a 35% baseline—suggesting that price, reviews, and host responsiveness drive “high value” marks. Finally, clustering distilled three clear market segments: high-end, review-rich core-city rentals; mid-range family/large-group homes; and budget stays on the periphery.
As a result, hosts and property managers can benchmark their own rates, amenities, and minimum-stay rules against neighborhood peers to fine-tune their positioning—whether they aim to undersell or command a premium. Similarly, guests gain clearer expectations around cost, location, and service quality when choosing between central and suburban stays. In addition, Airbnb itself could integrate these insights into search rankings (promote listings with high predicted value scores) or automated alerts (notify hosts whose response rates fall below neighborhood medians). Beyond that, urban planners and tourism boards can leverage our cluster maps and word-cloud overviews to identify under-served or over-served areas, thereby helping to balance short-term rental growth with broader affordable-housing goals.
Looking ahead, although our models explain a substantial share of observable variation, there remains untapped signal in interior features, seasonality effects, and dynamic demand curves. Therefore, future work could layer in time-series calendar data, image-based quality metrics, or third-party amenity indexes to sharpen predictions—particularly for rarer cases like non-Wi-Fi listings or long-term stays. Nonetheless, this project demonstrates that even straightforward, transparent techniques can yield actionable insights into a complex urban lodging market.